Models based on deep convolutional networks have dominated recent imageinterpretation tasks; we investigate whether models which are also recurrent,or "temporally deep", are effective for tasks involving sequences, visual andotherwise. We develop a novel recurrent convolutional architecture suitable forlarge-scale visual learning which is end-to-end trainable, and demonstrate thevalue of these models on benchmark video recognition tasks, image descriptionand retrieval problems, and video narration challenges. In contrast to currentmodels which assume a fixed spatio-temporal receptive field or simple temporalaveraging for sequential processing, recurrent convolutional models are "doublydeep"' in that they can be compositional in spatial and temporal "layers". Suchmodels may have advantages when target concepts are complex and/or trainingdata are limited. Learning long-term dependencies is possible whennonlinearities are incorporated into the network state updates. Long-term RNNmodels are appealing in that they directly can map variable-length inputs(e.g., video frames) to variable length outputs (e.g., natural language text)and can model complex temporal dynamics; yet they can be optimized withbackpropagation. Our recurrent long-term models are directly connected tomodern visual convnet models and can be jointly trained to simultaneously learntemporal dynamics and convolutional perceptual representations. Our resultsshow such models have distinct advantages over state-of-the-art models forrecognition or generation which are separately defined and/or optimized.
展开▼